Goto

Collaborating Authors

 stack overflow


StackEval: Benchmarking LLMs in Coding Assistance

Neural Information Processing Systems

LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance.




AI coding is now everywhere. But not everyone is convinced.

MIT Technology Review

AI coding is now everywhere. But not everyone is convinced. Developers are navigating confusing gaps between expectation and reality. So are the rest of us. Depending who you ask, AI-powered coding is either giving software developers an unprecedented productivity boost or churning out masses of poorly designed code that saps their attention and sets software projects up for serious long term-maintenance problems. The problem is right now, it's not easy to know which is true. As tech giants pour billions into large language models (LLMs), coding has been touted as the technology's killer app. Both Microsoft CEO Satya Nadella and Google CEO Sundar Pichai have claimed that around a quarter of their companies' code is now AI-generated. And in March, Anthropic's CEO, Dario Amodei, predicted that within six months 90% of all code would be written by AI.



Automating API Documentation with LLMs: A BERTopic Approach

Naghshzan, AmirHossein

arXiv.org Artificial Intelligence

Developers rely on API documentation, but official sources are often lengthy, complex, or incomplete. Many turn to community - driven forums like Stack Overflow for practical insights. We propose automating the summarization of informal sources, focusing on An - droid APIs. Using BERTopic, we extracted prevalent topics from 3.6 million Stack Overflow posts and applied extractive summarization techniques to generate concise summaries, including code snippets. A user study with 30 Android developers assessed the summaries for coherence, relevance, informativeness, and satisfaction, show - ing improved productivity. Integrating formal API knowledge with community - generated content enhances documentation, making API resources more accessible and actionable work.





FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Thakur, Nandan, Lin, Jimmy, Havens, Sam, Carbin, Michael, Khattab, Omar, Drozdov, Andrew

arXiv.org Artificial Intelligence

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.